Skip to content

Conversation

nielsbauman
Copy link
Contributor

We currently compute the shard allocation explanation for every unassigned shard (primaries and replicas) in the health report API when verbose is true, which includes the periodic health logs. Computing the shard allocation explanation of a shard is quite expensive in large clusters. Therefore, when there are lots of unassigned shards, ShardsAvailabilityHealthIndicatorService can take a long time to complete - we've seen cases of 2 minutes with 40k unassigned shards.

To avoid the runtime of ShardsAvailabilityHealthIndicatorService scaling linearly with the number of unassigned shards (times the size of the cluster), we limit the number of allocation explanations we compute to maxAffectedResourcesCount, which comes from the size parameter of the _health_report API and currently defaults to 1000 - a follow-up PR will address the high default size. This significantly reduces the runtime of this health indicator and avoids the periodic health logs from overlapping.

A downside of this change is that the returned list of diagnoses may be incomplete. For example, if the size parameter is set to 10, and the first 10 shards are unassigned due to reason X and the remaining unassigned shards due to reason Y, only reason X will be returned in the health API. We accept this downside as we expect that there are generally not many different diagnoses relevant - if more than size shards are unassigned, they're likely all unassigned due to the same reason. Users can always increase size and/or manually call the allocation explain API to get more detailed information.

…th indicator

We currently compute the shard allocation explanation for every
unassigned shard (primaries and replicas) in the health report API when
`verbose` is `true`, which includes the periodic health logs. Computing
the shard allocation explanation of a shard is quite expensive in large
clusters. Therefore, when there are lots of unassigned shards,
`ShardsAvailabilityHealthIndicatorService` can take a long time to
complete - we've seen cases of 2 minutes with 40k unassigned shards.

To avoid the runtime of `ShardsAvailabilityHealthIndicatorService`
scaling linearly with the number of unassigned shards (times the size of
the cluster), we limit the number of allocation explanations we compute
to `maxAffectedResourcesCount`, which comes from the `size` parameter of
the `_health_report` API and currently defaults to `1000` - a follow-up
PR will address the high default size. This significantly reduces the
runtime of this health indicator and avoids the periodic health logs
from overlapping.

A downside of this change is that the returned list of diagnoses may be
incomplete. For example, if the `size` parameter is set to `10`, and the
first 10 shards are unassigned due to reason `X` and the remaining
unassigned shards due to reason `Y`, only reason `X` will be returned in
the health API. We accept this downside as we expect that there are
generally not many different diagnoses relevant - if more than `size`
shards are unassigned, they're likely all unassigned due to the same
reason. Users can always increase `size` and/or manually call the
allocation explain API to get more detailed information.
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @nielsbauman, I've created a changelog YAML for you.

Comment on lines +531 to +539
// Computing the diagnosis can be very expensive in large clusters, so we limit the number of
// computations to the maxAffectedResourcesCount. The main negative side effect of this is that
// we might miss some diagnoses. We are willing to take this risk, and users can always
// use the allocation explain API for more details or increase the maxAffectedResourcesCount.
// Since we have two `SharAllocationCounts` instances (primaries and replicas), we technically
// do 2 * maxAffectedResourcesCount computations, but the added complexity of accurately
// limiting the number of calls doesn't outweigh the benefits, as the main goal is to limit
// the number of computations to a constant rather than a number that grows with the cluster size.
if (verbose && unassigned <= maxAffectedResourcesCount) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clarify any of this in the documentation of the API? I'm inclined to say no, but wanted to bring it up to see if others feel differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants